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parallel  computations.  Wrongly,  as  we  shall  argue,  since  the  laws  of  phy¬ 
sics  intrude  forcefully  when  we  want  to  obtain  realistic  estimates  of  the 
performance  of  parallel  or  distributed  algorithms.  First,  we  shall  explain 
why  it  is  reasonable  to  abstract  away  from  the  physical  details  in  sequen¬ 
tial  computations.  Second,  we  show  why  certain  common  approaches  in 
the  theory  of  parallel  complexity  do  not  give  useful  information  about  the 
actual  complexity  of  the  parallel  computation.  Third,  we  give  some 
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complexity  of  distributed  computations.  ^ _ _ _ 
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1.  Introduction 


The  earliest  electronic  computing  engines  arose  as  a  byproduct  of  the  Manhattan  Project 
in  World  War  II.  Broadly  speaking,  their  purpose  was  to  compute  numerical  solutions  to 
second  order  partial  differential  equations  arising  in  connection  with  the  design  of  the 
atomic  bomb.  The  machines  consisted  of  primitive  logical  and  memory  components  like 
electromagnetic  relays  and  mercury  delay  lines,  which  where  wired  up  so  as  to  have  the 
complex  perform  the  desired  computation.  The  architecture  reflected  the  type  of  algo¬ 
rithm  to  be  performed,  i.c.,  the  solution  of  the  mentioned  equations  by  numerical  grid 
methods.  Such  algorithms  suggest  parallel  or  pipelined  execution,  and  that  is  exactly  the 
type  of  architecture  of  those  first  computers  [Gol72].  Only  at  the  present  time,  in  the 
middle  eighties,  have  we  come  full  circle  and  see  such  special  purpose  architectures  again 
in  the  pipelined  and  systolic  algorithms  frozen  in  the  silicon  hardware  of  chips.  Once 
more,  the  shift  is  away  from  sequential  thinking  in  the  form  of  line-by-line  programs  of 
imperative  or  other  nature,  and  to  representing  algorithms  in  structures  of  space  and 
time. 

After  the  Manhattan  Project  had  been  fulfilled,  computer  designers  quickly  pro¬ 
gressed  to  the  idea  of  automating  all  types  of  computational  tasks.  Rather  than  stooping 
to  the  chore  of  rewiring  a  new  complex  for  every  new  task  which  came  along,  the  idea 
arose  of  letting  the  computer  take  over  that  job  as  well.  Thus,  the  idea  of  a  general  pur¬ 
pose  computer  entered  the  scene.  It  so  happened  that  mathematicians  like  H.H. 
Goldstine,  J.  von  Neumann  and  A.W.  Burks  were  well  aware  of  A.M.  Turing’s  brilliant 
1936  paper  [Tur36]  in  which  he  described  an  architecture  for  just  such  a  hypothetical 
machine: 

“Computing  is  normally  done  by  writing  certain  symbols  on  paper.  We  may  suppose  this 
paper  to  be  divided  into  squares  like  a  child’s  arithmetic  book.  In  elementary  arithmetic  the  two- 
dimensional  character  of  the  paper  is  sometimes  used.  But  such  use  is  always  avoidable,  and  I 
think  that  it  will  be  agreed  that  the  two-dimensional  character  of  paper  is  no  essential  of  compu¬ 
tation.  I  assume  then  that  the  computation  is  carried  out  on  one-dimensional  paper,  i.e.,  on  a  tape 
divided  into  squares.  I  also  suppose  that  the  number  of  symbols  which  may  be  printed  is  finite.” 

“The  behaviour  of  the  [human]  computer  at  any  moment  is  determined  by  the  symbols  he  is 
observing,  and  his  ‘state  of  mind’  at  that  moment.  We  may  suppose  that  there  is  a  bound  B  to 
the  number  of  symbols  or  squares  which  the  computer  can  observe  at  one  moment.  If  he  wishes 
to  observe  more,  he  must  use  successive  observations.  We  will  also  suppose  that  the  number  of 
states  of  mind  which  need  be  taken  into  account  is  finite.” 

“We  suppose  [above]  that  the  computation  is  carried  out  on  a  tape;  but  we  avoid  introduc¬ 
ing  the  “state  of  mind”  by  considering  a  more  physical  and  definitive  counterpart  of  it.  It  is 
always  possible  for  the  computer  to  break  off  from  his  work,  to  go  away  and  forget  all  about  it, 
and  later  to  come  back  and  go  on  with  it.  If  he  does  this  he  must  leave  a  note  of  instructions 
(written  in  some  standard  form)  explaining  how  the  work  is  to  be  continued.  This  note  is  the 
counterpart  of  “the  state  of  mind.”  We  will  suppose  that  the  computer  works  in  such  a  desultory 
manner  that  he  never  does  more  than  one  step  and  write  the  next  note.  Thus  the  state  of  progress 
of  the  computation  at  any  stage  is  completely  determined  by  the  note  of  instructions  and  the 
symbols  on  the  tape.  That  is,  the  state  of  the  system  may  be  described  by  a  single  expression 
(sequence  of  symbols),  consisting  of  the  symbols  on  the  tape  followed  by  A  (which  we  suppose  not 
to  appear  elsewhere)  and  then  by  the  note  of  instructions.  This  expression  may  be  called  the 


“state  formula.”  We  know  that  the  state  formula  at  any  given  stage  is  determined  by  the  state 
formula  before  the  last  step  was  made,  and  we  assume  that  the  relation  of  these  two  formulae  is 
expressible  in  the  functional  calculus.  In  other  words,  we  assume  that  there  is  an  axiom  A  which 
expresses  the  rules  governing  the  behaviour  of  the  computer,  in  terms  of  the  relation  of  the  state 
formula  at  any  stage  to  the  state  formula  at  the  preceding  stage.  If  this  is  so,  we  can  construct  a 
machine  to  write  down  the  successive  state  formulae,  and  hence  to  compute  the  required 
number.” 

Grasping  the  implied  architectural  concept,  and  improving  it  according  to  the  lee¬ 
way  provided  by  physical  law,  Burks,  Goldstine  and  von  Neumann  in  1946  wrote  a 
memorandum  [Bur46]  which  shaped  the  architecture  of  electronic  computers  for  the  next 
forty  years.  This  memorandum  was  preceded  by  the  famous  ‘First  Draft’  [Neu45],  where 
we  can  clearly  distinguish  the  serial  mode  of  operation  of  the  modern  computer,  i.e.,  one 
instruction  at  a  time  is  inspected  and  then  executed.  This  is  in  sharp  distinction  to  the 
parallel  operation  of  the  earlier  ENIAC  computer  in  which  many  things  where  simultane¬ 
ously  being  performed.  To  abandon  all  parallelism  was  not  thought  of  as  detrimental  to 
performance,  since  the  potential  speed  of  the  electronic  techniques  was  judged  to  be  fast 
enough.  Complainants  about  the  ‘von  Neumann’  bottleneck  (explained  below),  inherent 
in  the  stored  program  sequential  computer  as  we  know  it,  should  realize  that  the  concep¬ 
tual  advantage  of  this  scheme  is  what  made  possible  the  giant  strides  of  progress:  if  cars 
had  become  so  much  cheaper  as  computing  power  has,  a  car  would  cost  less  than  1  dol¬ 
lar. 

Turing’s  analysis  of  the  process  of  computation  as  the  sequential  execution  of  a 
sequence  of  operations  is  so  natural,  that  it  seems  as  if  Euclid  in  designing  one  of  the 
earliest  known  algorithms  (for  computing  the  greatest  common  divisor)  must  have  had 
such  an  architecture  in  mind.  Now  it  so  happens,  that  in  sequential  computation  we  can 
ignore  many  physical  details  of  the  underlying  computer  system  in  analysing  the  compu¬ 
tational  complexity  of  some  program.  Each  operation  essentially  consists  of  a  sequence 
of  “fetch  from  memory,”  “execute  operation  on  one  or  more  operands  in  the  Central 
Processing  Unit”  and  “store  in  memory.”  The  CPU  operations  can  oe  thought  of  -  when 
viewed  from  sufficient  distance  -  as  essentially  finite  automata  transitions  which 
transform  input  obtained  by  a  bounded  number  of  “fetch  from  memory”  operations  (say 
2)  into  output  in  the  form  of  “store  in  memory”  operations  (say  1).  In  the  usual  setup, 
a  memory  register  has  a  fixed  length  (say  48  bits)  and  both  the  memory  accesses  and 
CPU  operations  take  a  fixed  time  (say,  at  most  X).  Therefore,  a  sequence  of  n  operar 
tions  takes  in  between  nX  and  4nX  time.  Forgetting  about  the  X  and  the  small  con¬ 
stants  like  4,  it  is  usual  to  say  that  n  operations  take  n  ‘time.’  Note,  here  ‘time’  means 
number  of  steps.  Similarly,  it  is  assumed  that  all  objects  manipulated  fit  in  a  single 
memory  location.  Moreover,  that  each  object  is  ‘random  accessible,’  that  is,  each  object 
can  be  accessed  as  fast  as  any  other.  This  is  referred  to  as  the  ‘unit  cost  measure.’ 

This  scheme  is  sometimes  refined  to  take  into  account  that  some  items  being  mani¬ 
pulated  do  not  fit  in  a  48  bit  register  -  as  for  instance  the  123rd  Mersenne  prime.  It  is 
then  customary  to  charge  the  cost  of  manipulating  the  item  as  being  linear  in  its  length, 
both  in  terms  of  storage  and  in  terms  of  time  for  execution  of  an  operation.  This  is 


referred  to  as  the  ‘logarithmic  cost  measure.’  It  is  clear,  that  this  time  cost  measure  is 
only  a  lower  bound,  since  the  actual  operations  performed  on  the  items  when  they  are 
chopped  up,  often  requires  more  than  time  linear  in  the  length  of  the  items.  For 
instance,  while  logarithmic  cost  may  be  reasonable  for  addition,  ,  it  is  not  reasonable  for 
multiplication. 

A  further  refinement  may  be  made  for  objects  not  held  in  ‘random  access’  memory, 
but  on  disk  or  mass  storage  devices  such  as  tapes.  There  an  operation  on  an  object  may 
involve  swapping  pieces  of  the  object  back  and  forth  from  disk  to  random  access 
memory,  thus  incurring  a  time  overhead  which  may  be  orders  of  magnitudes  greater 
than  the  time  spent  on  manipulating  in  the  CPU  and  random  access  memory.  Think 
about  the  sorting  or  merging  of  huge  data  files.  The  logarithmic  cost  measure  tries  to 
take  such  an  overhead  into  account  by  charging  as  the  cost  of  a  memory  access  also  the 
length  of  the  memory  address.  As  in  the  case  of  the  registers,  this  can  be  only  a  very 
crude  lower  bound  on  the  actual  cost.  We  thus  distinguish  a  memory  hierarchy,  where 
the  access  times  of  objects  stored  at  different  levels  differs  orders  of  magnitudes. 

While  the  physical  aspects  of  computing  devices  can  thus  be  fairly  well  accounted 
for,  the  basic  unit  of  time  a  transaction  takes  does  not  vary  too  wildly  within  each  level 
we  have  distinguished.  It  is  therefore  more  or  less  justified  to  forget  about  the  details 
and  talk  only  about  the  number  of  operations  at  each  level  of  the  memory  hierarchy.  As 
we  will  see,  in  the  realm  of  nonsequential  computation  reality  can  not  be  ignored  to  such 
an  extent. 

Since  in  current  computers  the  time  of  a  basic  operation  in  the  CPU  is  generally  far 
lower  than  that  of  memory  accesses,  most  computations  are  memory  bound,  i.e.,  the 
time  spent  in  accessing  various  levels  in  the  memory  hierarchy  completely  dominates  the 
computation  time.  This  is  popularly  called  the  ‘von  Neumann’  bottleneck.  Are  the  pros¬ 
pects  any  brighter  in  the  coming  era  of  nonsequential  computation? 

2.  Space 

In  many  areas  of  the  theory  of  parallel  computation  we  meet  tree  structured  devices  or 
computations. 

(1) .  For  instance,  ‘parallel  random  access  machines  (PRAM’s)’  can  at  each  point  in  their 

computation  spawn  a  couple  of  offspring  PRAM’s  to  perform  some  subcomputa- 
tions.  Broadly  speaking,  we  can  therefore  imagine  the  computation  as  a  binary  tree 
of  processors.  The  ‘time’  the  computation  takes  is  then  linearly  related  to  the  depth 
of  the  tree. 

(2) .  In  [Mea80]  this  idea  is  translated  into  terms  of  ‘very  large  integrated  circuits.’  In 

Chapter  8  the  authors  show  a  bold  picture  of  a  complete  binary  tree,  and  explain 
that  such  a  tree  with  processors  in  each  node,  is  capable  of  solving  NP-complete 
problems  like  the  ‘traveling  salesman  problem’  in  linear  time.  This,  on  the  grounds 
that  the  processor  at  the  root  can  send  a  copy  of  the  problem  instance  to  each  of 
the  leaves,  and  each  of  the  leaves  can  try  one  candidate  solution.  A  simple  scheme 
can  guarantee  that  each  leaf  tries  a  different  solution,  each  solution  is  tried  by  some 
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leaf,  and  all  answers  are  percolated  upwards  to  the  root.  If  positive  answers  win 
over  negative  ones  in  the  fan  in,  the  answer  the  root  receives  is  a  solution  if  there  is 
one  and  ‘no  solution’  if  there  is  none. 

(3).  One  of  the  currently  flourishing  parts  of  the  theory  of  parallel  computation  is  ‘NC- 
computation.’  A  problem  is  in  ‘Nick’s  Class’  if  it  can  be  solved  in  polylogarithmic 
‘time’  using  a  polynomial  number  of  processors.  Here,  ‘time’  means  the  length  of 
the  longest  chain  of  causally  related  steps. 

All  of  the  above  models  may  say  something  about  the  parallelizability  of  algorithms 
for  certain  problems.  This  often  takes  the  form  of  distributing  copies  of  the  entire  prob¬ 
lem  instance,  or  pieces  of  the  problem  instance,  among  an  exponential  number  of  proces¬ 
sors  in  a  linear  number  of  steps.  Or,  as  in  NC,  among  a  polynomial  number  of  processors 
in  a  polylogarithmic  number  of  steps.  The  way  a  problem  instance  can  be  divided  and 
partial  answers  put  together  may  give  genuine  insight  into  its  parallelizability.  However, 
it  can  not  give  a  reduction  from  an  asymptotic  exponential  time  best  algorithm  in  the 
sequential  case  to  an  asymptotic  polynomial  time  algorithm  in  any  parallel  case.  At 
least,  if  by  ‘time’  we  mean  time.  This  can  be  seen  easily  as  follows.  If  the  parallel  algo¬ 
rithm  uses  2"  processing  elements,  regardless  of  whether  the  computational  model 
assumes  bounded  fan-in  and  fan-out  or  not,  it  can  not  run  in  time  polynomial  in  n , 
because  physical  space  has  us  in  its  tyranny.  Viz.,  if  we  use  2”  processing  elements  of, 
say,  unit  size  each,  then  the  tightest  they  can  be  packed  is  in  a  3-dimensional  sphere  of 
volume  2" .  No  unit  in  the  sphere  can  be  closer  to  all  other  units  than  a  distance  of 
radius  R , 


Modulo  a  major  advance  in  physics,  it  is  impossible  to  transport  signals  over  2*”  (a>0) 
distance  in  polynomial  p(n )  time.  In  fact,  the  assumption  of  the  bounded  speed  of  light 
suggests  that  the  lower  time  bound  on  any  computation  using  2n  processing  elements  is 
n(2"/3)  outright.  Or,  for  the  case  of  NC  computations  which  use  na  processors,  o>0, 
the  lower  bound  on  the  computation  time  is  fl(na/3). 
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The  situation  is  worse  than  it  appears  on  the  face  of  it.  Consider  an  architecture 
such  as  the  binary  n  -cube.  This  is  the  network  in  which  the  nodes  are  identified  by  n  - 
bit  names,  and  there  is  a  communication  edge  between  two  nodes  if  their  identifiers 
differ  in  a  single  bit.  Call  this  graph  C=(V  ,E).  Let  C  be  embedded  in  3-dimensional 
Euclidean  space,  and  let  each  node  have  unit  volume.  Let  x  be  any  node  of  C .  There 
are  at  most  2"  /8  nodes  within  Euclidean  distance  R  / 2  of  *  ,  where  R  is  as  above. 
Then,  there  are  >7-2*  /8  nodes  at  Euclidean  distance  /2  from  *  .  Construct  a  span¬ 
ning  tree  Tx  of  in  C  of  depth  <n  with  node  x  as  the  root.  The  average  Euclidean 
length  of  a  path  from  the  root  in  Tx  is  >7 R  /16,  and  therefore  the  average  Euclidean 
length  of  an  embedded  edge  in  a  path  from  the  root  in  Tx  is  >7 R  /16n .  This  does  not 
give  a  lower  bound  on  the  average  Euclidean  length  of  an  edge  in  Tx .  However,  using 
the  symmetry  of  the  binary  n-cube  we  can  establish  that  the  average  Euclidean  length 
of  the  edges  in  the  3-space  embedding  of  C  is  >7 R  /16n  .  We  can  prove  this  as  follows. 
(The  hasty  reader  may  skip  the  proof  by  proceeding  to  the  second  column  on  the  next 

page) 

Proof.  Denote  a  node  a  in  C  by  a  n-bit  string  a1a2  •  •  •  o„ ,  and  an  edge  (a, 6) 
between  nodes  a  and  b  differing  in  the  *  th  bit  by: 

a  =  a  |  •  •  •  Sj-ia,-  a,+i  •  •  •  o„ 

This  means  that  an  edge  has  two  representations.  Now  we  can  express  a  set  /  of  iso¬ 
morphic  mappings  of  C  to  itself  by  (l)  a  cyclic  permutation  of  the  representation  of 
nodes  and  edges,  followed  by  (2)  complementation  of  the  bits  of  the  representations  in  a 
given  pattern.  I.e.,  the  isomorphism  (j  ,c1c2  •  •  •  c„  )£/  maps  the  above  edge  a  to 

6  =  bj+l  •  ■  •  6<_i$,  bi+1  •  •  •  bnbx  ■  •  ■  bj 
with  =a,  if  c,  — 0  and  6,-  —a~  (=  complement  a,)  if  c,  =l. 

Consider  the  ensemble  S  of  spanning  trees  of  C,  each  tree  isomorphic  with  Tx 
above,  consisting  of  the  n  2"  trees  i(Tx)  to  which  Tx  is  snapped  by  the  n  2”  distinct 
isomorphisms  i  in  /.  For  each  edge  e  in  Tx  and  each  edge  e 1  of  C  there  are  two  dis¬ 
tinct  isomorphisms  ij  and  i2  in  /  such  that  ij(e )=ij(e )=e '  .  The  average  Euclidean 
length  of  a  path  from  the  root  in  each  tree  i(Tx)ES  (»€/)  is  >7 R  /16,  so  the  average 
Euclidean  length  of  a  path  from  the  root  taken  over  all  trees  t(Tx  )€<?  (»'€/)  is 
>7R  /1 6  as  well.  Let  the  Euclidean  length  of  an  edge  e  in  the  3-space  embedding  of  C 
be  1  (e  ).  Then,  for  each  edge  e  of  Tx : 

£/(.(e))  =  2E/(e) 

»€/  e€£ 

That  is,  each  edge  in  the  embedded  C  occurs  twice  as  the  same  edge  of  the  canonical 
tree  Tx  in  the  form  of  the  corresponding  isomorphic  edge  in  some  tree  in  S .  Therefore, 
the  average  Euclidean  length  of  the  edges  in  trees  in  5,  which  correspond  to  a  single 
particular  edge  of  Tx ,  equals  the  average  Euclidean  length  of  an  edge  in  E .  Let  P  be  a 
path  from  the  root  in  Tx  consisting  of  |  P  |  <n  edges.  Then,  the  average  sum  of  the 
Euclidean  lengths  of  the  edges  in  a  path  i '(F)  from  the  root  in  all  trees  t(Tx)  («€/) 
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equals  |  P  |  times  the  average  Euclidean  edge  length  in  E: 


EE'(«(*))  =  2|P|  £/(«) 

*€/*'€/ 


Consequently,  the  average  Euclidean  edge  length  in  E  equals  the  average  Euclidean 
length  of  an  edge  in  a  path  P  from  the  root  in  a  tree  in  S ,  and  is  therefore  >1R  /16n : 


E*(«) 

tg  E 
n  2"-1 


EE'(i(0) 

rn  t  £Pi  6i 

Per,  n2n\P\ 


>UL 

16n 


Since  there  are  n  2" /2  edges  in  the  binary  n-cube,  this  sums  up  to  an  amazing 
total  wire  length  E teE^(e  )  nee^ed  in  the  Euclidean  3-dimensional  embedding  of  C  of 


E'(«)> 

t£E 


2"  7  R 
32 


Ii/3 

J_j  72(4b' 


Many  network  topologies  are  afflicted  with  this  problem:  n  -dimensional  cube  net¬ 
works,  fast  Fourier  networks,  butterfly  networks,  shuffle-exchange  networks,  cube- 
connected  cycles  networks,  and  so  on.  In  fact,  the  arguments  seem  to  hold  for  networks 
with  a  small  diameter  which  satisfy  certain  symmetry  requirements.  An  example  of  a 
network  with  small  diameter  which  is  not  symmetric  in  this  sense  is  the  tree.  The  fact 
that  7/8th  of  all  paths  from  the  root  in  a  complete  tree  would  have  Euclidean  length 
>R  /2  in  a  3-space  embedding  do  not  imply  that  the  average  Euclidean  length  of  an 
embedded  edge  of  the  tree  is  larger  than  a  constant.  This  is  borne  out  by  the  familiar 
H-tree  layout  [Mea80]  where  the  average  edge  length  is  less  than  3  or  4.  However,  in  the 
recently  investigated  ‘fat  tree’  architectures  the  wire  length  will  dominate  again.  In  a 
complete  binary  fat  tree  of  depth  n  and  root  at  level  0,  a  node  at  level  » +1  is  connected 
to  a  node  at  level  i  by  a  ‘bundle’  of  2n  *  edges.  Then,  trivially,  the  average  Euclidean 
length  of  an  edge  in  a  path  from  the  root  equals  the  average  Euclidean  length  of  an  edge 
in  the  fat  tree,  leading  to  the  result  above. 

Note.  Deriving  the  result  about  the  total  necessary  wire  length  for  embedding  the 
binary  n-cube,  we  did  not  make  any  assumptions  about  the  volume  of  a  wire  of  unit 
length,  or  the  way  they  are  embedded  in  space,  as  is  usual  [U1184].  It  is  consistent  with 
the  derived  results  that  wires  have  zero  volume,  and  that  infinitely  many  wires  can  pass 
through  a  unit  2-dimensional  area.  Such  assumptions  invalidate  the  arguments  used  else¬ 
where.  In  contrast  with  other  inve.  ‘igations,  the  goal  here  is  to  derive  lower  bounds  on 
the  total  wire  length  irrespective  of  the  ratio  between  the  volume  of  a  unit  length  wire 
and  the  volume  of  a  processing  element.  The  lower  bound  on  the  total  wire  length  above 
is  independent  of  this  ratio,  which  changes  with  different  technologies  or  granularity  of 
computing  components. 
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Iterating  the  above  reasoning,  but  now  adding  the  volume  of  the  wires  to  the 
volume  of  the  nodes,  the  greatest  lower  bound  on  the  volume  necessary  to  embed  the 
binary  n-cube  converges  to  a  particular  solution  in  between  a  total  volume  of  f2(24"/3) 
and  a  total  volume  of,  say,  0( 22" )  if  we  charge  a  constant  fraction  of  the  unit  volume 
for  a  unit  wire  length.  The  lower  bound  f2(24"^3)  ignores  the  fact  that  the  added  volume 
of  the  wires  pushes  the  nodes  further  apart,  thus  necessitating  longer  wires  again.  The 
0{ 22" )  upper  bound  holds  under  the  assumption  that  wires  of  all  lengths  have  the  same 
volume  per  unit  length  (not  more  than  a  constant  fraction  of  the  unit  volume  of  a  node). 
In  a  later  section  I  show  that  the  latter  assumption  cannot  always  be  made. 

These  surprising  facts  are  a  theoretical  prelude  to  many  wiring  problems  currently 
starting  to  plague  computer  designers  and  chip  designers  alike.  Formerly,  a  wire  had 
magical  properties  of  transmitting  data  ‘instantly*  from  one  place  to  another  (or  better, 
to  many  other  places).  A  wire  did  not  take  room,  did  not  dissipate  heat,  and  did  not 
cost  anything  •  at  least,  not  enough  to  worry  about.  This  was  the  situation  when  the 
number  of  wires  was  low,  somewhere  in  the  hundreds.  Current  designs  use  many  millions 
of  wires  (on  chip),  or  possibly  billions  of  wires  (on  wafers).  In  a  computation  of  parallel 
nature,  most  of  the  time  seems  to  be  spent  on  communication  -  transporting  signals  over 
wires.  Thus,  thinking  that  the  von  Neumann  bottleneck  has  been  conquered  by  nonse¬ 
quential  computation,  we  are  unaware  that  the  Non-von  Neumann  bottleneck  is  still 
waiting.  The  following  innominate  quote  covers  this  matter  admirably: 

“Without  me  they  fly  they  think;  But  when  they  fly  I  am  the  wings.” 

Another  effect  which  becomes  increasingly  important  is  that  most  of  the  room  in 
the  device  executing  the  computation  is  taken  up  by  the  wires.  Under  very  conservative 
estimates  that  the  unit  length  of  a  wire  has  a  volume  which  is  a  constant  fraction  of 
that  of  a  component  it  connects,  we  can  see  above  that  in  3-dimensional  layouts  for 
binary  n  -cubes,  or  for  the  other  fast  permutation  networks,  the  volume  of  the  2"  com¬ 
ponents  performing  the  actual  computation  operations  is  an  asymptotic  fastly  vanishing 
fraction  of  the  volume  of  the  wires  needed  for  communication: 


volume  computing  components 


volume  communication  wires 
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Today  it  seems  that  a  partial  solution  to  this  problem  can  be  found  in  optical  com¬ 
munication,  either  wireless  by  means  of  lasers/infrared  light  or  by  using  virtually  unlim¬ 
ited  bandwidth  glass  fiber.  But  beware,  even  while  Nature  is  not  malicious,  she  is  sub¬ 
tle. 


3.  Time 

It  is  useful  to  distinguish  between  distributed  computation  and  distributed  control. 
Whereas  the  former  is  concerned  with  the  distributed  solution  of  problems  for  which 
there  also  exist  sequential  algorithms,  the  latter  is  concerned  with  problems  which  make 
no  sense  in  terms  of  sequential  computation.  Examples  of  the  former  are  parallel  algo¬ 
rithms  for  matrix  multiplication,  fast  Fourier  transform,  shortest  path,  matching. 
Examples  of  the  latter  are  methods  for  mutual  exclusion  and  nameserver  [Mul85], 


distributed  spanning  tree,  clock  synchronization  algorithms,  Byzantine  agreement,  leader 
election,  symmetry  breaking.  In  distributed  control  the  notion  of  time  plays  an  all- 
important  role. 

As  large  multiprocessor  systems  communicating  by  message  passing  start  to  be 
actually  constructed,  and  on  a  geographically  grander  scale  very  large  computer  net¬ 
works,  synchronization  problems  connected  with  the  operation  of  such  complexes  are 
bound  to  become  acute.  Another  problem  which  gets  crucial  for  very  large  computer 
complexes  is  the  number  of  message  passes.  Without  efficient  congestion  control  the  sys¬ 
tem  will  be  swamped  by  communication  messages  effectively  blocking  throughput. 

To  fix  thoughts,  the  networks  we  consider  are  point-to-point  (store-and-forward) 
communication  networks  described  by  an  undirected  communication  graph,  with  the  set 
of  nodes  representing  the  processors  of  the  network,  and  the  set  of  links  representing 
bidirectional  noninterfering  communication  channels  between  them.  No  common 
memory  is  shared  by  the  node- processors.  Each  node  processes  messages  received  from 
its  neighbors,  performs  local  computations  on  messages  and  sends  messages  to  neighbors. 
All  these  actions  take  a  finite  time.  All  messages  have  a  finite  length  accordiug  to  the 
finite  amount  of  information  they  carry.  Each  message  sent  by  a  node  to  its  neighbor 
arrives  there  in  a  finite  time.  A  message  pass  consists  of  the  sending  of  a  message  from 
one  node  to  one  of  its  direct  neighbors.  In  order  to  make  the  cost  measure  meaningful, 
when  we  express  the  complexity  of  some  algorithm  in  the  number  of  message  passes,  we 
want  to  exclude  unrealistically  long  messages.  One  choice  is  to  allow  messages  of  size 
O(log  n ),  where  n  is  the  number  of  nodes  in  the  network.  The  time  complexity  of  a  dis¬ 
tributed  algorithm  should  obviously  be  the  size  of  the  interval  between  the  beginning 
and  the  end  of  the  algorithm.  As  yet  there  seems  to  be  no  completely  satisfactory  gen¬ 
eral  method  to  compute  this  cost  constructively,  given  the  algorithm,  for  the  many  types 
of  distributed  algorithms  which  are  known.  However,  this  is  only  one  of  many  problems 
associated  with  the  concept  of  time  in  distributed  systems. 

Here  we  focus  on  problems  resulting  from  lack  of  synchronization.  These  can  be 
dealt  with  using  ‘partially  ordered’  time,  as  in  [Lam78],  or  by  constructing  algorithms 
that  can  deal  with  unlimited  asynchrony.  The  latter  algorithms  can  surely  deal  with 
any  environment  in  which  there  is  knowledge  about  processor  speed  and  message 
delivery  time.  Unlimited  asynchronous  models  have  been  thoroughly  investigated,  as 
have  purely  synchronous  models.  Physical  systems  are  usually  somewhere  in  between: 
they  are  neither  purely  synchronous  nor  unlimited  asynchronous.  It  is  therefore  an 
interesting  exercise  to  develop  algorithms  that  do  not  use  knowledge  about  the  relative 
progress  of  time  in  the  system,  yet  perform  superior  under  realistic  conditions  about 
time.  The  usual  logically  time-independent  algorithms  do  not  assume  anything  about 
the  rate  at  which  time  flows  in  different  locations.  This  is  unnecessarily  harsh  with 
respect  to  many  problems  arising  in  the  real  world.  Clock  drift  in  systems  happens  with 
a  certain  smoothness,  since  abrupt  changes  are  rare  in  nature.  It  seems  to  be  worthwhile 
to  investigate  robust  algorithms  such  that: 

•  the  algorithms  remain  correct  and  terminate  under  any  behavior  of  time  in  the  sys¬ 
tem, 


•  using  time,  the  algorithms  are  yet  logically  time-independent,  only  their  efficiency 
depends  on  the  behavior  of  time, 

•  with  increasing  synchronous  well-behaved  time  in  the  system  the  performance  of 
the  algorithm  improves  ever  faster, 

•  if  the  asynchrony  of  the  system  is  known  then  the  algorithm  performs  as  well  as  in 
the  synchronous  case, 

•  under  practical  assumptions  about  clock  speeds  these  algorithms  use  less  message 
passes  than  is  possible  by  any  other  known  methods  for  the  problems  they  solve  in 
asynchronous  systems, 

•  the  limitation  on  unlimited  asynchrony  such  algorithms  require  is  but  a  minor  one 
which  is  generally  satisfied  and  which  we  term  “Archimedean  asynchronicity”. 

Now,  in  asynchronous  distributed  systems  each  processor  has  its  own  clock. 
Although  these  clocks  may  not  be  synchronized,  and  the  clocks  may  not  indicate  the 
same  time,  there  should  be  some  proportion  between  the  clock  rates.  That  is,  if  an  inter¬ 
val  of  time  has  passed  on  the  clock  for  processor  A  ,  a  proportional  period  of  time  has 
passed  on  the  clock  for  processor  B . 

Definition  .  A  distributed  system  is  Archimedean  from  time  ^  to  time  t2  if  the 
ratio  of  the  time  intervals  between  the  ticks  of  the  clocks  of  any  pair  of  processors,  and 
the  ratio  between  the  communication  delay  between  any  adjacent  pair  of  processors  and 
the  time  interval  between  the  ticks  of  the  clock  of  any  processor,  is  bounded  by  a  fixed 
integer  during  the  time  interval  from  to  t2-  (This  ratio  need  not  be  bounded  a  priori, 
nor  need  it  be  known  to  the  processors  concerned.) 

That  is,  in  asynchronous  networks  the  magnitudes  of  elapsed  time  should  satisfy 
the  axiom  of  Archimedes.  The  axiom  of  Archimedes  holds  for  a  set  of  magnitudes  if,  for 
any  pair  a ,  b  of  such  magnitudes,  there  is  a  multiple  na  which  exceeds  b  for  some 
natural  number  n .  It  is  called  Archimedes’  axiom*  possibly  due  to  an  application  in 
obtaining  large  numbers  in  The  Sand-Reckoner. 

We  assume  that  the  magnitudes  of  elapsed  time,  as  measured,  for  instance,  by  local 
clocks  amongst  different  processors  or  by  the  clock  of  the  same  processor  at  different 
times,  as  well  as  the  magnitudes  consisting  of  communication  delays  between  the  sending 
and  receiving  of  messages,  as  measured,  for  instance,  in  absolute  physical  time,  all 
together  considered  as  a  set  of  magnitudes  of  the  same  kind,  satisfy  the  Archimedean 
axiom.  In  physical  reality  it  is  always  possible  to  replace  a  magnitude  of  elapsed  time,  of 
any  clock  or  communication  delay,  by  a  corresponding  magnitude  of  elapsed  absolute 
physical  time,  thus  obtaining  magnitudes  of  the  same  kind.  We  assume  a  global  absolute 
time  to  calibrate  the  individual  clocks;  using  relative  time  by  having  the  clocks  send 
messages  to  one  another  yields  the  same  effect  -  for  the  purposes  at  hand.  If  we  do  not 

•  In  Sphere  and  Cylinder  and  Quadrature  of  the  Parabola  Archimedes  formulates  the  postulate  as  follows. 
“The  larger  of  two  lines,  areas  or  solids  exceeds  the  smaller  in  such  a  way  that  the  difference,  added  to  it¬ 
self,  can  exceed  any  given  individual  of  the  type  to  which  the  two  mutually  compared  magnitudes  be¬ 
long".  The  axiom  appears  earlier  as  Definition  4  in  Book  5  of  Euclid's  Elemente. 
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restrict  ourselves,  so  to  speak,  to  Archimedean  distributed  systems,  then  the  processors 
in  the  system  may  not  have  any  sense  of  time.  Or,  they  have  clocks  which  keep  purely 
subjective  time,  so  that  the  unit  time  span  of  each  processor  is  unrelated  to  that  of 
another.  That  is,  the  set  of  time  units  is  non-Archimedean  if  the  length  of  every  time 
unit  is  not  less  than  a  finite  multiple  of  that  of  any  other  in  the  absolute  global  time 
scale.  Or,  the  communication  delays  have  no  finite  ratio  among  themselves  or  with 
respect  to  subjective  processor  clocks.  As  a  consequence: 

•Any  process,  pausing  indefinitely  long  with  respect  to  the  time-scale  of  the  others, 
between  events  like  the  receiving  and  passing  of  a  message,  and  also  any  unbounded 
communication  delay,  effectively  aborts  activities  such  as  an  election  in  progress.  A  pro¬ 
cess  can  never  be  sure  that  it  is  the  only  one  which  considers  itself  elected. 

-Without  physical  time  and  clocks  there  is  no  way  to  distinguish  a  failed  process  from 
one  just  pausing  between  events. 

-A  user  or  a  process  can  tell  that  a  system  has  crashed  only  because  he  has  been  waiting 
too  long  for  a  response. 

Distributed  systems  in  the  sense  of  physically  distributed  computer  networks  com¬ 
municate  by  sending  signed  messages  and  setting  timers,  or  equivalent  devices.  If  an  ack¬ 
nowledgement  of  safe  receipt  by  the  proper  addressee  is  not  received  by  the  sender 
before  the  timer  goes  off,  the  sender  sends  out  a  new  copy  of  the  message  and  sets  a 
corresponding  timer.  This  process  is  repeated  until  either  a  proper  acknowledgement  is 
received  or  the  sender  concludes  that  the  message  cannot  be  communicated  due  to 
failures.  Thus,  clocks  and  timeouts  are  necessary  attributes  of  real  distributed  systems 
and  non-Archimedean  time  in  the  system  is  intolerable  outright.  Whereas  unlimited 
asynchrony  would  prevent  a  system  from  functioning  properly,  pure  synchrony  in  a  sys¬ 
tem  cannot  exist:  the  clocks  of  distinct  processors  drift  apart  in  both  indicated  time  and 
running  speed  and  have  to  be  resynchronized  by  algorithms  running  in  Archimedean 
time  as  defined  above. 

We  may  call  this  concept  of  algorithms  using  physical  time,  instead  of  being  oblivi¬ 
ous  to  physical  time,  one  of  time- driven  algorithms.  The  use  of  such  algorithms  would  be 
in  the  area  of  distributed  control  in  synchronous  or  asynchronous  systems.  Some  prob¬ 
lems  necessarily  have  time-driven  algorithms,  while  the  algorithms  for  other  problems 
may  or  may  not  be  time-driven.  For  example,  in  algorithms  for  clock  synchronization 
and  distributed  spanning  tree  and  distributed  elections,  the  former  are  time-driven  by 
cause  of  their  very  subject  matter,  while  the  latter  may  be  time-driven  by  design  or  not 
be  time-driven  at  all.  The  primary  goal  of  an  investigation  into  the  feasibility  of  such 
algorithms  in  [Vit84,  Vit85j  was  to  demonstrate  the  existence  of  competitive  time-driven 
algorithms  with  the  desirable  properties  as  mentioned.  These  algorithms  where  superior 
in  terms  of  message  passes.  More  significantly,  they  performed  better  than  allowed  by 
known  lower  bounds  on  the  number  of  message  passes  required  in  asynchronous  net¬ 
works.  Unfortunately,  they  where  quite  unrealistic  in  terms  of  running  time.  Nonethe¬ 
less,  we  expect  that  genuinely  more  efficient  algorithms  than  the  unlimited  asynchronous 
ones  exist,  in  between  the  pure  synchronous  and  unlimited  asynchronous  ones. 
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4.  Physics 

Apart  from  space  and  time,  nature  intrudes  obstrusively  in  nonsequential  computation  in 
the  form  of  physics.  We  give  an  example  from  the  field  of  VLSI  taken  from  [Vit85'....]. 

In  current  chips,  synchronization  requirements  slow  down  the  computation  to  a 
clocked  switching  time,  which  is  in  the  order  of  the  delay  in  the  longest  wire.  As  the 
minimal  feature  width  continues  to  decrease  into  the  submicron  range,  this  delay 
governs  overall  performance  more  and  more.  In  order  to  obtain  very  high  speed  integra¬ 
tion,  one  way  to  go  is  to  obtain  a  propagation  delay  logarithmic  in  the  length  of  the 
wire,  as  in  [Mea80].  Electronic  considerations  show  [Mea82]  that  all  wires  then  need  to 
have  the  same  ratio  between  width  and  length,  that  is,  the  same  aspect  ratio.  Below  we 
derive  this  fact,  and  show  some  of  the  consequences. 

4.1.  Electronics 

Analysis  of  signal  propagation  delay  in  wires  on  chip  requires  different  models  in 
different  cases:  transmission  line,  distributed  RC  and  lumped  RC.  However,  the  dominat¬ 
ing  factor  on  a  densely  packed  chip  is  that  a  wire  is  not  alone,  but  surrounded  by  other 
wires.  This  fact  leads  to  the  following  analysis  [Mea82,  Vit85?..]. 
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The  time  it  takes  a  minimum  transistor  to  drive  a  wire  of  length  L  ,  width  W  and 
thickness  H  can  be  estimated  as  follows.  The  wire  is  assumed  to  have  distance  Dt  to 
neighbouring  layers  and  D9  to  other  wires  in  the  same  layer.  If  W0  is  the  minimal  width 
of  a  wire  in  the  current  technology,  then  the  minimal  transistor,  consisting  of  a  wire 
crossing,  occupies  area  Wq  .  The  total  time  T  to  drive  a  wire  is  approximated  by: 

T  «  (R,  +RW )  Cw  (1) 

where  Rt  is  the  resistance  of  the  minimum  transistor,  Rm  the  resistance  of  the  wire  and 
Cw  its  capacitance. 

Therefore,  the  total  time  T  can  be  thought  of  as  the  sum  of  the  time  T4  needed  to 
drive  a  zero  resistance  wire  of  capacitance  Cv ,  and  the  time  Rw  Cm  needed  to  transport 
the  appropriate  charge  from  a  zero  resistance  source.  Roughly,  Tt  is  the  time  needed  to 
transport  the  necessary  charge  through  the  bottleneck  consisting  of  the  switch  (the 
minimal  transistor),  and  Rw  Cm  is  the  time  needed  to  distribute  the  charge  appropriately 
over  the  wire  w .  Since  the  resistance  of  a  wire  is  proportional  to  its  length  and  inversely 
proportional  to  its  cross  section  we  have: 


where  pw  is  the  resistivity  of  the  considered  wire  material.  The  capacitance  of  a  wire  is 
inversely  proportional  to  the  distance  of  its  neighbouring  wires  and  layers,  and  propor¬ 
tional  to  the  area  of  the  side  facing  that  neighbouring  layer  or  wire: 

c.  (3) 

where  ew  is  a  proportional  constant  consisting  of  the  product  of  the  permitivity  of  free 
space  and  the  dielectric  constant  of  the  insulating  material  (usually  Si02).  Thus, 

K  C.  =2,.  t.  -LL(JL  +  J!L)  .  (4) 

This  suggests  a  signal  propagation  time  quadratic  in  L .  However,  the  resistance  Rt  of 
the  minimum  transistor  dominates  in  (1)  for  the  magnitudes  of  L  under  consideration 
(smaller  than,  say,  1  foot).  We  can  decrease  that  term  by  fitting  a  larger  driver  transistor 
to  the  wire.  This  transistor,  in  its  turn,  must  be  driven  by  the  minimal  transistor.  Iterat¬ 
ing  this  scheme,  cf  [Mea80],  we  obtain  a  sequence  of  transistors,  of  which  each  next  one 
is  a  factor  a  larger  than  the  preceding  one.  The  final  transistor  in  the  sequence  should 
be  large  enough  to  drive  the  wire  in  a  sufficiently  short  time.  (We  can  think  of  this 
scheme  as  a  sequence  of  switches  where  each  switch  serves  to  switch  the  next  larger 
switch,  and  the  largest  switch  in  the  sequence  controls  the  large  channel  through  which 
the  charge  is  transported  to  the  wire.  Although  the  time  to  actually  pass  the  appropriate 
charge  from  source  to  wire  can  be  made  smaller  by  fitting  a  larger  final  driver  transistor 
to  the  sequence,  there  seems  no  way  to  get  rid  of  the  time  needed  to  switch  all  transis¬ 
tors  in  between  the  smallest  transistor  and  the  largest  one.)  The  time  to  drive  a  driver 
with  capacitance  C2  by  a  driver  with  smaller  capacitance  Cx  is  given  by  [MeaSO]: 


where  r  is  the  time  it  takes  a  minimal  transistor  to  charge  the  gate  of  another  minimal 
transistor.  If  Ct  is  the  capacitance  of  the  minimal  transistor  then  for  a  ramp  of  r 
drivers: 


r  —  logB  - 


(6) 


taking  Tt  —  r  ra  time  to  charge  the  wire  if  it  had  no  resistance.  The  capacitance  of 
the  minimum  transistor  is  given  by 


C,  =  t, 


W0a 

So 


(7) 


where  Dq  is  the  thickness  of  the  gate  insulator  and  ct  is  the  product  of  the  permitivity 
of  free  space  and  the  dielectric  constant  of  the  gate  insulator.  Thus  we  can  drive  a  zero 
resistance  wire  of  capacitance  Cm  through  a  sequence  of  r  drivers  for  fixed  a  in  time: 
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a  r  loga 


From  (1),  (4)  and  (8)  we  obtain  an  expression  for  T =Td+C„Rw .  In  [Mea82]  it 
was  observed  that  by  keeping  the  derivatives,  with  respect  to  L ,  of  the  two  terms  Td 


and  CWRW  balanced: 


L  lna 


-L-tJL  +  HL) 

whk  dw  d,  } 


T  grows  logarithmic  in  L  .  Viz.,  by  assumption  of  equality  (10)  we  obtain: 


According  to  (9)  we  obtain  logarithmic  signal  propagation  delay  by,  all  other  things 
being  equal, 


+  -  -  ■  —  )  —  constant 
ti  Di 


rather  than  by  just  keeping  L 2  proportional  to  WH  as  in  [Mea82].  Keeping  the 
interwire  distance  proportional  to  the  wire  width,  and  the  interlayer  distance  propor¬ 
tional  to  the  wire  height,  we  observe  that  if  W ,  H  and  L  are  kept  in  proportion  a  loga¬ 
rithmic  propagation  delay  is  attained.  (Note  that  we  cannot  reach  this  effect  by  keeping 
the  wire  width  the  same  but  using  very  ‘tali’  wires  or  vice  versa.)  The  aspect  ratio  of  a 
wire  is  the  quotient  of  its  width  and  length.  To  obtain  a  logarithmic  signal  propagation 
delay  we  thus  need  the  fixed  constant  aspect  ratio  following  from  (9)  and  (10)  for  all 
wires  in  the  layout.  In  designing  such  a  high  speed  layout  we  therefore  need  to  install 
drivers  to  drive  the  long  wires  and  to  design  all  wires  with  a  constant  aspect  ratio  a  >0. 
Therefore,  a  wire  of  length  L  in  such  a  layout  has  area  aL*.  The  area  taken  by  the 
driver  is  linear  in  the  length  of  the  wire  [Mea82]:  the  minimal  transistor  occupies  area 
Wq  ,  the  next  driver  area  cxWq  ,  and  so  on  for  log^  terms  for  an  L  -length  wire.  The 
total  driver  area  for  an  L  -length  wire  becomes  Wq  ( L  — l)/(of— 1).  This  area  is  required  at 
the  lowest  silicon  layer  of  the  chip;  the  long  interconnect  wires  are  executed  in  the  upper 
metal  layers. 

The  effect  of  having  all  wires  in  the  layout  with  the  same  constant  aspect  ratio 
spells  disaster  for  circuits  which  necessarily  have  many  long  wires.  This  holds  for  trees, 
but  more  so  for  fast  permutation  networks.  However,  let  us  look  what  happens  for 
natural  wire  length  distributions. 


4.2.  Wire  Length  Distributions 

Let  /  :  N  —*  N ,  connected  with  a  VLSI  layout,  be  a  wire  length  distribution  function 
which  yields  the  number  /  (i )  of  wires  of  length  i  in  the  design. 

Every  VLSI  layout  must  have  a  constant  bounded  fan-in  and  fan-out  of  wires  for 
the  components  (transistors).  If  the  chip  area  is  A  ,  then  a  reasonable  assumption  is  that 
the  maximal  wire  length  on  a  chip  does  not  exceed 


(11) 


Lmtx=  VA  . 

Consequently,  the  amount  of  wires  in  the  layout  is  given  by 

VX 

#  wires  =  £/(*)  .  (12) 

i-=  1 

To  achieve  logarithmic  propagation  delay  we  can  estimate  and  bound  the  layout 
area  occupied  by  the  fattened  wires  as  follows.  Let  C  be  the  amount  of  area  of  the  lay¬ 
out  occupied  by  non-wire  components  such  as  transistors.  Assuming  that  C  is  also  the 
order  of  magnitude  of  the  number  of  basic  components  like  transistors  or  logic  gates  in 
the  circuit  we  can  reason  as  follows.  Since  the  wires  only  serve  to  connect  components 
we  have  C  £  0(#  wires)  in  a  connected  layout.  The  components  are  assumed  to  have  at 
most  a  limited  t  connections  to  attach  wires,  which  we  suppose  to  account  also  for  the 
fan-in  and  fan-out  of  the  interconnect  wires.  Therefore  C  £  fl(#  wires)  and  conse¬ 
quently  C  £  ©(#  wires).  Since  we  are  primarily  interested  in  orders  of  magnitude  in  the 
sequel,  we  are  justified  to  use  C  interchangeably  for  the  amount  of  area  occupied  by  the 
non-wire  components,  the  number  of  non-wire  components  and  the  number  of  wires. 
The  maximal  area  occupied  by  the  wires  (and  interwire  distances)  under  (10)  is  bounded 
by  the  available  area: 

Va 

£/(»')  ai  ^A-C  ,  (13) 

i=»  1 

where  a  is  the  constant  quotient  of  width  and  length  (the  aspect  ratio )  of  the  connect 
wires  as  required  by  (10).  Using  a  simple  theoretical  argument  and  an  experimental 
study  of  actual  layouts  [Don81]  develops  the  following  wire  length  distribution  relation¬ 
ship: 

/(»')  =  L«'~XJ  (1  < «'  J  and  (14) 

/(«)«*  0  (»  >L  miX) 

for  a  normalization  constant  c  yet  to  be  chosen.  Here  is  a  constant  related  to  the 
size  of  the  array  (rectangular  chip)  and  the  adequacy  of  the  placement;  and  X  is  a  con¬ 
stant  characteristic  of  the  logic.  Equation  (14)  is  derived  using  “Rent’s  Rule”  which 
states  that  the  average  number  of  terminals  per  complex  of  C  elements  (in  units, 
modules,  cards,  gates  etc.)  is  tCr ,  where  t  is  the  number  of  connections  per  individual 
element  and  p  is  the  Rent  constant  characteristic  of  the  logic  complex.  The  analysis  goes 
by  dividing  a  square  array  of  cells  into  4  equal  square  arrays  recursively  down  until  the 
individual  areas  are  the  individual  elements  of  the  original  logic.  On  each  level  of  the 
recursion  the  number  of  connections  crossing  boundary  lines  is  determined  using  Rent’s 
rule.  This  shows  that  X  «=*  3-2p .  In  [Don81j  experimental  results  are  given  for  some 
actual  layouts  placed  using  a  hierarchical  placement  program:  layouts  for  high-speed 
logic  where  p  was  found  to  be  0.75  and  a  layout  for  a  hand  calculator  chip  with 
p=0.59.  Let  furthermore  the  network  be  connected,  so  the  maximal  amount  of  area 
units  C  available  to  place  the  components  is  not  greater  than  the  number  of  wires  plus 
1. 


Considering  just  the  wire  length  distribution  while  leaving  free  the  actual  circuit 
topology,  placement  and  routing  in  the  layouts,  attaining  a  logarithmic  signal  propaga¬ 
tion  delay  by  changing  constant  wire  width  to  constant  aspect  ratio  for  all  wires  in  a 
layout  can  carry  a  surprisingly  severe  penalty.  This  follows  immediately  from  (11),  (12), 
(13)  and  (14),  and  is  expressed  by  the  theorem  below.  The  (simple)  analysis  of  this  fact, 
and  the  proof  of  the  Theorem,  are  relegated  to  the  Appendix. 

Theorem.  Let  the  original  layout  area  be  A  and  the  original  amount  of  wires  in 
the  layout  be  C.  For  the  wire  length  distribution  f  (> )  =  Lc,  lJ  f0T  1<*  and 

f  (i )*=*()  for  i  >s/A  ,  the  change  from  constant  wire  width  to  wires  with  a  constant 
aspect  ratio  has  the  following  effect. 

(i)  Keeping  f  and  C  the  same,  the  area  has  to  increase  from  A  to  exp(n(\ZA~ )). 

(ii)  Keeping  f  and  A  the  same,  the  number  of  wires  (c.q.  components)  has  to  decrease 
from  C  to  0(log  C). 

(iii)  Keeping  A  and  C  the  same,  the  wire  length  distribution  has  to  change  to 
/'(«')  =  (.c'  ,-<2+«)j  for  some  small  e>0  (l<i  <VA  ). 

We  observe  that  in  case  (i)  of  the  Theorem  the  wires  get  so  long  that  the  loga¬ 
rithmic  propagation  delay  turns  out  to  yield  about  the  same  absolute  time  delay  as  in 
the  original  wires.  In  case  (ii)  of  the  Theorem  matters  are  probably  as  bad  because  the 
bit  capacity  of  the  chip  has  been  logarithmically  reduced.  Finally,  in  case  (iii)  of  the 
Theorem  the  subject  circuit  topology  may  not  have  a  layout  with  the  required  wire 
length  distribution. 

It  therefore  appears  that  only  circuits  for  which  there  are  layouts  with  wire  length 
distributions  with  relative  large  X,  will  profit  from  this  scheme  for  logarithmic  signal  pro¬ 
pagation  delay. 
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Appendix 

From  (13)  and  (14)  we  can  estimate  the  maximal  figure  for  the  normalisation  constant  c  .  For 

X^3: 

.  _  (A-CX3-X) 


a  (A  (**x)/2-l)  ’ 


and  for  X=3, 


c&zgl 

a  log  A 


Consequently,  for  X^l  A  X^3  by  (12): 

o~  S /(O- 

i.i  a  (l-X)(A^x)/J-l) 


and  for  X=3, 


For  X=l, 


C  *=» 


aA  log  A 


A-C 


C  «  \'°*A  •  (18c) 

i-l  <*(*-1) 

(Note:  for  X<1  we  obtain  c  <1,  resulting  in  /  (t  )<**<)  also  for  small » ,  and  C  a  small  constant.) 


For  comparison  we  give  an  analogous  analysis  under  the  constant  wire  width  assumption. 
Then  equations  (11)  •  (12)  stay  the  same  but  equation  (13)  becomes 

vT 

.  (17) 

i=l 

Thus,  for  /  (» )  =  l_cs -XJ  (1<«  <'/A‘ )  and  /  (»' )  *=»  0  (»  >\/A  )  and  with  A  ,  C  and  c  as 
above  we  obtain  the  following  relations.  For  X— 1: 

_  A-C 


v/X-  1 

C  »  (A  . 

2(vX-l) 


(18) 


For  X^l  & 


A^U 


(19) 


For  X=2: 


(Note:  for  X<0  we  obtain  c  <1.)  For  X>0  we  have  C  G  ft(vX  )■  Thus: 

Proof  of  Theorem.  Since  we  assume  the  circuit  to  be  connected  we  have 
A  >  A-C  >  j4  /2  in  the  various  equations.  We  also  assume  A  »  1. 

(i)  Equate  expression  (18)  for  C  with  expression  (16c)  for  C ,  with  A  1  substituted  for  A  in 
the  latter.  This  yields  log  A  1  6  ft (y/A  ). 

(ii)  Substitute  C'  for  C  in  equation  (18)  and  express  C'  in  terms  of  C  by  eliminating  A 
from  the  resulting  equation  and  (16c). 

(iii)  Equate  expression  (18)  for  C  with  expression  (16a)  for  C  (expressions  (16b)  and  (16c)  con¬ 
tradict  (18)).  The  terms  (A  -C )  on  both  sides  cancel  each  other.  Solving  X  yields 
X  =  2 +e(A  ,  a )  >  2  with  e(A  ,  a )— *0  for  A  — »oo  and  a  constant.  Every  distribution 
with  exponent  equal  or  larger  than  this  X  suffices.  • 
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